Skip to content

Add managed-memory advise, prefetch, and discard-prefetch free functions#1775

Open
rparolin wants to merge 74 commits intoNVIDIA:mainfrom
rparolin:rparolin/managed_mem_advise_prefetch
Open

Add managed-memory advise, prefetch, and discard-prefetch free functions#1775
rparolin wants to merge 74 commits intoNVIDIA:mainfrom
rparolin:rparolin/managed_mem_advise_prefetch

Conversation

@rparolin
Copy link
Copy Markdown
Collaborator

@rparolin rparolin commented Mar 17, 2026

Summary

Adds managed-memory range operations to cuda.core:

  • Free functions in cuda.core.utils: advise, prefetch, discard, discard_prefetch. Each accepts either a single Buffer or a sequence; N==1 dispatches to the per-range driver entry point and N>1 dispatches to the corresponding cuMem*BatchAsync (CUDA 13+).
  • Host — new top-level singleton class symmetric to Device. Host() (any host), Host(numa_id=N), Host.numa_current(). Same-argument constructions are interned (Host() is Host()). Used together with Device to express managed-memory locations.
  • ManagedBufferBuffer subclass returned by ManagedMemoryResource.allocate. Exposes a Pythonic property-style advice API on top of the same free functions. Wrap an external managed pointer with Buffer.from_handle(...) (now a @classmethod, so ManagedBuffer.from_handle(...) returns a ManagedBuffer).

Closes #1332. Addresses the managed-memory portion of #1333 (P1: cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync). The P0 cuMemcpyBatchAsync from #1333 is intentionally out of scope and tracked separately; the holistic batched-API contract this PR commits to is documented in #issuecomment-4355502334 so the upcoming cuMemcpyBatchAsync work can mirror it.

Public API

ManagedBuffer — property-style advice on managed allocations

ManagedMemoryResource.allocate returns a ManagedBuffer (a Buffer subclass). All ManagedBuffer-specific behavior is layered on top of the free functions, so the two surfaces stay consistent.

from cuda.core import Device, Host, ManagedMemoryResource

mr = ManagedMemoryResource()
buf = mr.allocate(size)                # ManagedBuffer

# Driver-backed properties — getter queries the driver, setter calls cuMemAdvise.
buf.read_mostly = True
buf.preferred_location = Device(0)     # or Host(), or Host(numa_id=N)
buf.preferred_location = None          # unset

# Live set-like view of `set_accessed_by` advice.
buf.accessed_by.add(Device(1))
buf.accessed_by.discard(Device(1))
buf.accessed_by = {Device(0), Device(1)}   # diff vs current; advise only deltas

# Instance methods delegate to the matching free functions.
buf.prefetch(Device(0), stream=stream)
buf.discard(stream=stream)
buf.discard_prefetch(Device(0), stream=stream)

Note: on CUDA 13 builds, preferred_location round-trips full NUMA detail via the v2 attribute (Host(numa_id=N) and Host.numa_current() are preserved on read-back). On CUDA 12 builds, the legacy cuMemRangeGetAttribute query path returns integer device ordinals, so Host(numa_id=...) collapses to a generic Host() on read-back. Setters preserve full NUMA information when issuing advice on both toolkits.

Free functions — advise / prefetch / discard / discard_prefetch

Each accepts a Buffer (or ManagedBuffer) or a sequence of them. Locations are expressed via Device or Host.

from cuda.core import Device, Host
from cuda.core.utils import advise, prefetch, discard, discard_prefetch

# Stage to GPU, kernel, bring back to host
prefetch(buf, Device(0), stream=stream)
launch_my_kernel(buf, stream=stream)
prefetch(buf, Host(), stream=stream)
stream.sync()
result = bytes(buf)

# Advice
advise(weights, "set_read_mostly")
advise(activations, "set_preferred_location", Device(0))
advise(scratch, "set_accessed_by", Device(0))

# Discard / discard+prefetch (CUDA 13+)
discard(scratch, stream=stream)
for step in range(num_steps):
    discard_prefetch(activations, Device(0), stream=stream)
    run_forward(activations, stream=stream)

Batched form — same function, sequence of targets

When N>1, dispatch goes to the corresponding cuMem*BatchAsync. Sequence locations are paired by index; a scalar location broadcasts to every target.

# Pair-by-index: output → GPU 0, log_metrics → host
prefetch(
    [output, log_metrics],
    [Device(0), Host()],
    stream=stream,
)

# Scalar broadcast: every shard moves to GPU 0
prefetch([shard_a, shard_b, shard_c], Device(0), stream=stream)

Mismatched sequence lengths raise ValueError. On a CUDA 12 build of cuda.core, N>1 raises NotImplementedError (the *BatchAsync entry points are CUDA 13+); N==1 works on every supported toolkit.

Putting it together

weights = mr.allocate(weights_size)    # ManagedBuffer
inputs  = mr.allocate(inputs_size)
output  = mr.allocate(output_size)

# One-time hints (property API on ManagedBuffer)
weights.read_mostly = True
weights.preferred_location = Device(0)
output.preferred_location = Device(0)

# Per inference
inputs.prefetch(Device(0), stream=stream)
run_inference(weights, inputs, output, stream=stream)
output.prefetch(Host(), stream=stream)
stream.sync()

Implementation notes

  • Cython implementation in cuda_core/cuda/core/_memory/_managed_memory_ops.pyx uses cimport cydriver for direct C-level driver calls.
  • The CUDA 12 / 13 ABI split for cuMemAdvise and cuMemPrefetchAsync is handled at compile time with IF CUDA_CORE_BUILD_MAJOR >= 13: / ELSE:.
  • Batched entry points (cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync) are CUDA 13+ only. On CUDA 12 builds, N>1 calls raise NotImplementedError; single-buffer calls work everywhere.
  • Host is a singleton with __slots__ and a __new__-based intern cache keyed by (numa_id, is_numa_current). Same-argument constructions return the same instance on both Python and Cython call paths.
  • ManagedBuffer is a pure-Python subclass of the Cython Buffer cdef class. Buffer.from_handle is now a @classmethod (was @staticmethod) so MyBufferSubclass.from_handle(...) returns the typed instance via cls._init. Buffer_from_deviceptr_handle and _MP_allocate thread an optional cls parameter so ManagedMemoryResource.allocate materializes a ManagedBuffer.
  • Internal _LocSpec (in _managed_location.py) carries the (kind, id) discriminator that the Cython layer maps to CUmemLocation (CUDA 13) or a legacy device ordinal (CUDA 12). Public callers see only Device / Host; _coerce_location produces the internal record.
  • _buffer.pyx collapses out.is_managed = (is_managed != 0) to a single unconditional assignment and adds a TODO noting that HMM/ATS-mapped sysmem is not yet captured by CU_POINTER_ATTRIBUTE_IS_MANAGED.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Mar 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rparolin rparolin requested a review from Andy-Jost March 17, 2026 00:41
@rparolin rparolin self-assigned this Mar 17, 2026
@rparolin rparolin added this to the cuda.core v0.7.0 milestone Mar 17, 2026
@rparolin rparolin marked this pull request as ready for review March 17, 2026 00:45
@rparolin rparolin marked this pull request as draft March 17, 2026 00:45
@rparolin rparolin changed the title wip Add managed-memory advise, prefetch, and discard-prefetch on Buffer Mar 17, 2026
@rparolin rparolin marked this pull request as ready for review March 17, 2026 00:57
@github-actions
Copy link
Copy Markdown

@rparolin
Copy link
Copy Markdown
Collaborator Author

/ok to test

@jrhemstad
Copy link
Copy Markdown

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented Mar 17, 2026

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore.

@rparolin rparolin marked this pull request as draft March 17, 2026 19:35
@rparolin rparolin marked this pull request as ready for review March 17, 2026 23:46
rparolin and others added 7 commits March 17, 2026 17:30
…ups, fix docs

- Remove duplicate long-form "cu_mem_advise_*" string aliases from
  _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly
- Replace 4 boolean allow_* params in _normalize_managed_location with a
  single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES
- Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag,
  discard_prefetch support, and advice enum-to-alias reverse map
- Collapse hasattr+getattr to single getattr in _managed_location_enum
- Move _require_managed_discard_prefetch_support to top of discard_prefetch
  for fail-fast behavior
- Fix docs build: reset Sphinx module scope after managed_memory section in
  api.rst so subsequent sections resolve under cuda.core
- Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e legacy path

The _V2_BINDINGS cache in _buffer.pyx persists across tests, so
monkeypatching get_binding_version alone is insufficient when earlier
tests have already populated the cache with the v2 value. Promote
_V2_BINDINGS from cdef int to a Python-level variable so tests can
monkeypatch it directly via monkeypatch.setattr, and reset it to -1
in both legacy-signature tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t real hardware

These three tests call cuMemAdvise on real CUDA devices and verify
memory range attributes. On devices without concurrent_managed_access
(e.g. Windows/WDDM), set_read_mostly silently no-ops and
set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the
stricter _skip_if_managed_location_ops_unsupported guard, matching the
pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s support

Reorder checks in discard_prefetch so _normalize_managed_target_range
runs before _require_managed_discard_prefetch_support. This ensures
non-managed buffers raise ValueError before the RuntimeError for missing
cuMemDiscardAndPrefetchBatchAsync support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ps module

Move advise, prefetch, and discard_prefetch functions and their helpers
out of _buffer.pyx into a new _managed_memory_ops Cython module to
improve separation of concerns. Expose _init_mem_attrs and
_query_memory_attrs as non-inline cdef functions in _buffer.pxd so the
new module can reuse them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
api.rst still listed the single-buffer free functions and *Options
dataclasses that were removed under R9/R11 (advise, prefetch, discard,
discard_prefetch and their *Options classes). Replace with the actual
cuda.core.utils exports: prefetch_batch, discard_batch,
discard_prefetch_batch. Drop the now-orphan :template: dataclass.rst
line.
@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 1, 2026

Retraction on issue 2 above.

After verification: issue 2 (cu12 ELSE branches calling broken cydriver signatures) is a false positive. CI builds cu12 cuda_core against cuda_bindings artifacts from the `12.9.x` backport branch (per `build-wheel.yml` and `ci/versions.yml`), and that branch's `cydriver.pxd.in` still exposes the v1 signatures:

```cython
cdef CUresult cuMemPrefetchAsync(CUdeviceptr, size_t, CUdevice, CUstream)
cdef CUresult cuMemAdvise(CUdeviceptr, size_t, CUmem_advise, CUdevice)
```

Those match the 4-arg int-device calls in `_managed_memory_ops.pyx` ELSE branches. The mismatch I flagged is only against `main`'s pxd.in, which post-cu13-cutover has been narrowed to v2 wrappers — but that's not what cu12 cuda_core compiles against. Sorry for the noise.

Issue 1 (api.rst phantom symbols) still stands and is fixed in 4c228eb.

@rparolin rparolin requested a review from rwgk May 4, 2026 16:24
@Andy-Jost
Copy link
Copy Markdown
Contributor

  1. Test coverage gaps — CUDA 12 batch fallback, AccessedBySet iteration, stream=None, error message assertions

Regarding AccessedBySet, to fill gaps and align with the graph module, I suggest:

  1. Rename AccessedBySet to AccessedBySetProxy. This is consistent with AdjacencySetProxy and more accurately describes the role of this object.
  2. Inherit from collections.abc.MutableSet to fill in the full set interface.
  3. Use assert_mutable_set_interface to test the full interface.

rparolin added 2 commits May 4, 2026 15:18
Mirror Device's singleton semantics so Host() is Host() and
Host(numa_id=1) is Host(numa_id=1) hold. Host.numa_current() returns
its own singleton, distinct from Host(), since it represents a
thread-relative location rather than a fixed one.

Construction routes through __new__ -> _get_or_create with a
double-checked dict + Lock cache keyed on (numa_id, is_numa_current).
__eq__ collapses to identity (consistent with the retained __hash__).
__reduce__ added so pickled Host instances round-trip back through
the singleton cache instead of stranding copies.

Resolves PR NVIDIA#1775 review: leofang and Andy-Jost requested Host follow
Device as a singleton so users can rely on `is` for identity checks.
Align with the graph module's AdjacencySetProxy: rename the class and
inherit from collections.abc.MutableSet so the full set interface
(remove, pop, clear, |=, &=, -=, ^=, isdisjoint, subset/superset
operators, etc.) is filled in automatically from the existing add /
discard / __contains__ / __iter__ / __len__ primitives.

Add classmethod _from_iterable so binary set operators (&|^) produce
plain sets rather than constructing a buffer-less proxy. Tighten add
to TypeError on non-Device/Host inputs and discard / __contains__ to
silently ignore them, matching MutableSet contracts. The hand-rolled
__eq__ (set/frozenset comparison) is dropped: Set ABC's default
implementation handles it correctly.

Resolves PR NVIDIA#1775 review (Andy-Jost, 2026-05-04): naming consistency
with AdjacencySetProxy and full MutableSet conformance.
@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 4, 2026

@Andy-Jost Done. Resolved by 7126324.

  • Renamed AccessedBySet -> AccessedBySetProxy (consistent with AdjacencySetProxy).
  • Inherits from collections.abc.MutableSet; the auto-derived methods (remove, pop, clear, |=, &=, -=, ^=, isdisjoint, subset/superset operators) now work against the live driver state.
  • Added _from_iterable so &|^ produce plain sets rather than buffer-less proxies; tightened add (TypeError on non-Device/Host) and discard/__contains__ (silently ignore non-Device/Host) to match MutableSet contracts.
  • Dropped assert_mutable_set_interface test (would skip on single-GPU CI; happy to add it if multi-GPU coverage exists).

- Annotate _instances / _instances_lock as ClassVar (RUF012).
- Sort __slots__ alphabetically (RUF023, auto-fixed by ruff).
@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 4, 2026

@leofang ping for a re-review. Your CHANGES_REQUESTED from 2026-04-30 is still active, but every inline thread has since been addressed (see the per-thread "Feedback addressed in " replies on 2026-05-01). The two follow-up items from this week — Host singleton (5743e05) and AccessedBySetProxy / MutableSet conformance (7126324) — are also done. Could you take another pass when you have time, so the request can be dismissed if it looks good?

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

@rparolin as discussed earlier offline, would it be possible if we merge all P0 PRs first? I have a very long review backlog...

Also, if @Andy-Jost has completed re-review, a new approval should be stamped to make it clear.

@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 4, 2026

@rparolin as discussed earlier offline, would it be possible if we merge all P0 PRs first? I have a very long review backlog...

Also, if @Andy-Jost has completed re-review, a new approval should be stamped to make it clear.

Yup. As discussed, this is not priority. But the changes are ready for you to review whenever you are available.

Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title and description seem to be very out-of-date. It looks like it'll be best to discard and re-generate from scratch (with an independent fresh agent that isn't polluted by the history with many revisions).

For the title, how about:

Add managed-memory ManagedBuffer class with advise / prefetch / discard-prefetch APIs, plus batched free functions

Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Outdated
Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Outdated
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 5, 2026

I switched back to using Cursor GPT-5.4 Extra High Fast (after feeling very frustrated with the undisclosed other Cursor model I used before).

GPT-5.4 Extra High Fast was thinking significantly longer, but also gives far more concise findings without any further prompting:

Findings

  • High: cuda_core/cuda/core/_memory/_managed_buffer.py:208 updates accessed_by incrementally, while full validation still happens later in cuda_core/cuda/core/_memory/_managed_memory_ops.pyx:233. If the new RHS contains any invalid entry, the setter can remove/add some advice and then fail, leaving partially applied state. I reproduced this locally: starting from {Device(0)}, assigning {Host(numa_id=0)} raises ValueError and leaves accessed_by == set().

  • Medium: CUDA 12 handling is incoherent for NUMA-specific host locations. _coerce_location() accepts Host(numa_id=...) / Host.numa_current(), but the CUDA 12 backend in cuda_core/cuda/core/_memory/_managed_memory_ops.pyx:149 can only serialize "device" and generic "host". That means public APIs such as ManagedBuffer.preferred_location in cuda_core/cuda/core/_memory/_managed_buffer.py:196, ManagedBuffer.prefetch() in cuda_core/cuda/core/_memory/_managed_buffer.py:218, and the CUDA 12 prefetch_batch() path all fail late with RuntimeError instead of handling or rejecting those inputs up front. The current ManagedBuffer docstring also implies the setter still works on CUDA 12 and only read-back is lossy, which is not what this implementation does.

  • Low: cuda_core/cuda/core/_host.py:38 accepts bool as numa_id, and the singleton cache in cuda_core/cuda/core/_host.py:45 then aliases Host(True) with Host(1) and Host(False) with Host(0). Whichever call happens first seeds the cached instance, so Host(1).numa_id can become True and repr(Host(1)) can become Host(numa_id=True).


The Medium finding seems like the least actionable one, the other two seem worth drilling down into.

@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 5, 2026

The PR title and description seem to be very out-of-date. It looks like it'll be best to discard and re-generate from scratch (with an independent fresh agent that isn't polluted by the history with many revisions).

For the title, how about:

Add managed-memory ManagedBuffer class with advise / prefetch / discard-prefetch APIs, plus batched free functions

Doh! Thank you. I'll regenerate the description now.

rparolin added 5 commits May 5, 2026 13:32
bool is an int subclass, so the previous guard let Host(True) and
Host(False) seed the singleton cache under the same keys as Host(1)
and Host(0). Whichever call landed first won, leaving repr(Host(1))
potentially showing as Host(numa_id=True). Reject bool explicitly.

Addresses rwgk's Low finding on PR NVIDIA#1775.
Move _require_managed_buffer to the first statement of _advise_one so
a non-managed buffer is rejected before advice/location parsing,
matching the order in _do_single_prefetch_py and
_do_single_discard_prefetch_py. This prevents surfacing an
advice-validation error when the real problem is the buffer kind.
Rephrase the RuntimeError raised from _to_legacy_device when a caller
passes Host(numa_id=...) or Host.numa_current() on a CUDA 12 build.
The new message names the unsupported APIs and points the user at
Host() as the working alternative, instead of leaking the internal
location_type discriminator.
The CUDA 12 cuMemPrefetchAsync / cuMemAdvise ABI takes a plain device
ordinal and cannot represent a specific host NUMA node. Previously
_coerce_location accepted Host(numa_id=...) and Host.numa_current()
on a CUDA 12 build and let the operation fail late inside the Cython
layer with RuntimeError, which the public APIs surfaced as a confusing
error from deep in the stack.

Reject NUMA-host kinds at the call boundary in _coerce_location with
a TypeError that names the unsupported APIs and points at Host() as
the working alternative. Update the ManagedBuffer docstring to match
the new contract, and broaden two host_numa-rejection test asserts to
accept either the CUDA 13 kind-allowed ValueError or the CUDA 12
boundary TypeError.

Addresses rwgk's Medium finding on PR NVIDIA#1775.
The previous setter computed (current - target) and (target - current)
and called _advise_one in two loops. set(locations) raised TypeError
on unhashable elements, but only after the first diff pair had already
been issued, so an invalid RHS could leave accessed_by partially
mutated. Reproduce: starting from {Device(0)}, assigning
{Host(numa_id=0)} on CUDA 12 raises and leaves accessed_by == set().

Validate every target up-front (per-element isinstance(Device|Host))
and only then issue the diff loops, so a bad RHS raises before any
driver state changes.

Addresses rwgk's High finding on PR NVIDIA#1775.
@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 5, 2026

@rwgk re your High finding (accessed_by setter applying diffs incrementally before validation):

Done. 1b66367.

The setter now validates every RHS element (isinstance(Device | Host)) before issuing any cuMemAdvise, so a bad target raises TypeError cleanly instead of leaving accessed_by partially mutated.

@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 5, 2026

@rwgk re your Medium finding (CUDA 12 incoherence with Host(numa_id=...) / Host.numa_current()):

Done. bcc056b.

_coerce_location now rejects NUMA-host kinds at the call boundary on CUDA 12 with TypeError("... require a CUDA 13 build of cuda.core; use Host() on CUDA 12") — no more late RuntimeError from _to_legacy_device. The ManagedBuffer docstring is updated to match the new contract, and two host_numa-rejection test asserts now accept either the CUDA 13 kind-allowed ValueError or the CUDA 12 boundary TypeError.

@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 5, 2026

@rwgk re your Low finding (Host(True) aliasing Host(1) in the singleton cache):

Done. d0b6621.

Host.__new__ now rejects bool explicitly (isinstance(numa_id, bool) short-circuits before the int check, since bool is an int subclass). Added TestHost.test_numa_id_rejects_bool covering both True and False.

rparolin added 3 commits May 5, 2026 13:40
Collapses multi-line string concats and conditions back to single lines
under the project's line-length limit. No behavior change.
…m_advise_prefetch

# Conflicts:
#	cuda_core/docs/source/release/1.0.0-notes.rst
Host(numa_id=N) and Host.numa_current() require CUDA 13 bindings; the
TestLocationCoerce passthroughs were missing the binding_version guard
already used by test_preferred_location_roundtrip_host_numa.
@rparolin rparolin enabled auto-merge (squash) May 5, 2026 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support managed memory advise, prefetch, and discard-prefetch

6 participants